Open
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request adds comprehensive Zeus energy profiling support to measure and track energy consumption during model training and inference. The implementation enables systematic energy efficiency experiments across different model configurations.
Key changes:
- Introduced a reusable
ZeusProfilerwrapper that gracefully handles optional Zeus installation and provides context managers for energy measurement windows - Integrated energy profiling into both standalone sampling (
sample.py) and training workflows (train.py), measuring per-sample energy and computing averages - Extended metrics tracking infrastructure to capture
avg_joules_infalongside existing performance metrics
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
utils/energy_profiling/zeus_profiler.py |
New module providing ZeusProfiler and ZeusWindow classes for optional energy profiling with clean API and graceful degradation when Zeus is unavailable |
utils/energy_profiling/__init__.py |
Package initialization exposing ZeusProfiler |
train_args.py |
Added command-line arguments for Zeus profiling configuration (enable/disable, GPU/CPU selection, device indices) |
train.py |
Integrated Zeus profiler into training workflow, collecting energy metrics during sampling and logging to TensorBoard and metrics files |
sample.py |
Added energy measurement to sample generation using Zeus context managers, returning per-sample and average energy consumption |
run_exploration_monitor.py |
Added avg_joules_inf column to monitoring UI for experiment tracking |
optimization_and_search/run_experiments.py |
Extended METRIC_KEYS and cast array to include avg_joules_inf metric |
explorations/energy_efficiency_zeus.yaml |
New exploration configuration for energy efficiency experiments with various model configurations and profiling enabled |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces energy profiling support using ZeusProfiler throughout the training and inference codebase. The main goal is to enable measurement and reporting of energy consumption (in joules) during model inference and sampling, and to surface this information in experiment results and monitoring tools. The changes span configuration, metrics tracking, sampling, and training scripts.
Energy Profiling Integration
Added Zeus energy profiling support to
sample.pyandtrain.py, including new command-line arguments for enabling Zeus profiling and specifying which devices (CPU/GPU) to profile. Energy consumption during inference is now measured and reported per sample and as an average. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17]Updated the training workflow to collect and log average inference energy after sampling, including TensorBoard support for tracking this new metric. [1] [2]
Metrics and Monitoring Updates
avg_joules_infas a tracked metric in experiment results, monitoring UI, and metrics parsing logic, ensuring energy consumption is visible in experiment summaries and dashboards. [1] [2] [3] [4]Configuration Improvements
energy_efficiency_zeus.yamlexploration configuration to include new static and variation groups relevant for energy profiling experiments, such as precision modes and profiling flags.These changes lay the groundwork for systematic energy efficiency experiments and monitoring, making it easier to evaluate and compare the energy consumption of different model configurations.